Skip to content

[slop]fix(workflow-engine): only commit step state after success#5010

Draft
abcxff wants to merge 1 commit into
05-06-fix_rivetkit_bind_methods_through_createwritethroughproxyfrom
05-09-fix_workflow-engine_only_commit_step_state_after_success
Draft

[slop]fix(workflow-engine): only commit step state after success#5010
abcxff wants to merge 1 commit into
05-06-fix_rivetkit_bind_methods_through_createwritethroughproxyfrom
05-09-fix_workflow-engine_only_commit_step_state_after_success

Conversation

@abcxff
Copy link
Copy Markdown
Contributor

@abcxff abcxff commented May 9, 2026

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 9, 2026

🚅 Deployed to the rivet-pr-5010 environment in rivet-frontend

Service Status Web Updated (UTC)
frontend-cloud ❌ Build Failed (View Logs) Web May 13, 2026 at 8:20 pm
frontend-inspector ❌ Build Failed (View Logs) Web May 13, 2026 at 8:20 pm
kitchen-sink ❌ Build Failed (View Logs) Web May 13, 2026 at 8:18 pm
website 😴 Sleeping (View Logs) Web May 12, 2026 at 3:59 am
ladle ❌ Build Failed (View Logs) Web May 9, 2026 at 3:40 pm
mcp-hub ✅ Success (View Logs) Web May 9, 2026 at 3:40 pm

Copy link
Copy Markdown
Contributor Author

abcxff commented May 9, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@abcxff abcxff changed the title fix(workflow-engine): only commit step state after success [slop]fix(workflow-engine): only commit step state after success May 9, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 9, 2026

Code Review: fix(workflow-engine): only commit step state after success

Note: This PR is still in Draft status with an empty description/checklist.

Overview

This PR fixes a correctness bug in workflow-engine where actor state mutations inside a failing step were being persisted even though the step needs to be retried. On retry, the step would execute against incorrectly mutated state. The fix:

  1. Introduces RAW_STATE_SYMBOL to bypass the write-through proxy and read raw actor state.
  2. Wraps every step() and tryStep() call in #withActorAccessAndStateRollback, which snapshots state+vars before the step and rolls them back on any error.
  3. Removes the per-branch flushStorage() calls in step failure catch blocks, consolidating to the single flush in setRetryState / executeWorkflow's unrecoverable-error path.
  4. Removes the stepData.error fallback (justified by the new test).

The core idea is correct and addresses a real workflow replay bug.


Bug: Breaks stateless actors using workflow steps

withActorAccessAndStateRollback unconditionally calls this.#runCtx[RAW_STATE_SYMBOL]() (workflow/context.ts:618), but ActorContextHandleAdapter[RAW_STATE_SYMBOL] throws stateNotEnabledError() when the actor has no state configured (native.ts:2460-2465).

This means any actor that uses workflow steps but does not have a state field will crash on the snapshot call, breaking all ctx.step() invocations regardless of whether the step touches state. The guard needs to handle the state-disabled case, for example by returning undefined and skipping the rollback assignment when #stateEnabled is false.

Concern: structuredClone(vars) can throw for non-cloneable objects

vars are runtime-in-memory (not persisted), so they can legitimately hold class instances, Promises, or WeakRefs. structuredClone throws a DataCloneError on these, crashing the snapshot before the step runs. Worth documenting that vars must be structured-cloneable when workflows are used, or guarding the clone.

Minor: Notification fires before flush for timeout/critical errors

Previously flushStorage() was called before notifyStepError for StepTimeoutError/CriticalError. Now the notify fires before the eventual top-level flush in executeWorkflow. This slightly widens the crash window between notification and persistence (duplicate notification possible on replay). Not a durability regression, but worth a comment.


Code Quality

Positive changes:

  • Consolidating entry.kind.data.error and entry.dirty = true before all the if-branches eliminates three identical blocks.
  • Removing the double-flush on failure (catch block plus setRetryState) is correct; the top-level flush in executeWorkflow guarantees persistence on all error paths.
  • Removing the stepData.error fallback is justified since the new test confirms the error is always written to the step entry.

Issues:

  1. Misleading test name (steps.test.ts:573): 'should not commit step error data to entry on failure' but the test asserts expect(entry.kind.data.error).toBe('Error: step failed'), i.e. the error is committed. Rename to 'should commit error but not output to step entry on failure'.

  2. Bare catch {} in the new test's first-run block silently swallows unexpected errors. Add a comment explaining what error is expected.

  3. // @ts-nocheck in workflow/context.ts: the RAW_STATE_SYMBOL call is unverified by the type system. The interface declares the method returning TState, with TState = never for stateless actors the type system would surface the mismatch if checking were enabled.

  4. NativeConnAdapter[RAW_STATE_SYMBOL] (native.ts:1162): added for interface completeness. The workflow rollback path only flows through ActorContextHandleAdapter, so this is dead code in practice. Fine as-is.


Test Coverage

  • New test is a good regression guard for the flush behavior; see naming concern above.
  • Removed StripStepHistoryErrorDriver test is a coherent removal - it tested the stepData.error fallback that was deliberately deleted.
  • Missing: no test covers stateless actor plus workflow steps. Once the crash is fixed, a test like 'should not crash when actor has no state and step fails' would prevent regression.
  • Loop expectation change (toBe(2) to toBe(3)) appears correct: crash at iteration 3, state now only persisted after successful Loop.continue at end of iteration 2, so iterations 3-5 need to re-execute on resume.

Summary

Area Finding
Core logic Correct - rollback on step failure fixes a real replay bug
Bug Stateless actors with workflow steps crash on RAW_STATE_SYMBOL() before step runs
Concern structuredClone(vars) throws if vars contains non-cloneable objects
Minor Notification fires before flush for exhausted/timeout errors (slightly wider crash window)
Test name 'should not commit step error data' contradicts what the test actually asserts
PR hygiene Description, type checkboxes, and test description are all empty; PR is still in draft

The core approach is sound. Please address the stateless actor crash (and consider the vars clone concern) before merging.

@abcxff abcxff changed the base branch from 05-07-fix_pegboard_validate_drain_grace_period_request_lifespan to graphite-base/5010 May 11, 2026 03:41
@abcxff abcxff force-pushed the graphite-base/5010 branch from 56af1d1 to aac9634 Compare May 12, 2026 12:55
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from cc1baa6 to 9304f04 Compare May 12, 2026 12:55
@abcxff abcxff changed the base branch from graphite-base/5010 to 05-07-fix_pegboard_validate_drain_grace_period_request_lifespan May 12, 2026 12:55
@abcxff abcxff changed the base branch from 05-07-fix_pegboard_validate_drain_grace_period_request_lifespan to graphite-base/5010 May 12, 2026 13:21
@abcxff abcxff mentioned this pull request May 12, 2026
11 tasks
@abcxff abcxff force-pushed the graphite-base/5010 branch from aac9634 to 5f7bac7 Compare May 12, 2026 15:24
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 9304f04 to 0668fa7 Compare May 12, 2026 15:24
@abcxff abcxff changed the base branch from graphite-base/5010 to 05-12-temp_fix_wasm-pack May 12, 2026 15:24
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 0668fa7 to 2c029c2 Compare May 12, 2026 17:46
@abcxff abcxff mentioned this pull request May 12, 2026
11 tasks
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 2c029c2 to 9b64fa5 Compare May 12, 2026 18:03
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 9b64fa5 to 0eb8f51 Compare May 12, 2026 18:44
@abcxff abcxff force-pushed the 05-13-fix_rivetkit_decode_bridge_errors_with_actor_null branch from 67b4f92 to e402a50 Compare May 13, 2026 17:50
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 8def8de to e67249c Compare May 13, 2026 17:50
@abcxff abcxff force-pushed the 05-13-fix_rivetkit_decode_bridge_errors_with_actor_null branch from e402a50 to 7b5d681 Compare May 13, 2026 18:04
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from e67249c to 31975cf Compare May 13, 2026 18:04
@abcxff abcxff force-pushed the 05-13-fix_rivetkit_decode_bridge_errors_with_actor_null branch from 7b5d681 to 8625f84 Compare May 13, 2026 18:31
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 31975cf to 26b0baa Compare May 13, 2026 18:31
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 13, 2026

[Updated review - see below for latest]

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 13, 2026

PR Review: fix(workflow-engine): only commit step state after success

Note: This PR is still in Draft status with an empty description/checklist.

Overview

Two related correctness fixes in the workflow engine:

  1. State rollback on step failure - Snapshots actor state and vars before a workflow step runs and restores them if the step throws, preventing failed steps from leaking mutations into persisted actor state.
  2. Deferred error flush - Removes flushStorage() calls from individual error branches, hoists the shared entry.kind.data.error / entry.dirty assignments above the if/else chain, and relies on the outer execution loop to flush.

Positives

  • RAW_STATE_SYMBOL is the right approach: bypassing the write-through proxy for the snapshot avoids cloning a proxied object. Consistent implementation on both NativeConnAdapter and ActorContextHandleAdapter.
  • Moving entry.kind.data.error = String(error) and entry.dirty = true above the if/else chain eliminates three nearly-identical copy-paste blocks. Good deduplication.
  • The rollback assignments this.#runCtx.state = stateSnapshot and this.#runCtx.vars = varsSnapshot land correctly: set state() calls #writeState({ scheduleSave: true }) and the pending save is overwritten with snapshot data before the next flush checkpoint.

Issues

Misleading test name (clear bug)

The new test is titled "should not commit step error data to entry on failure" but the assertions directly contradict the name:

// The error should be committed for inspection.
expect(entry.kind.data.error).toBe("Error: step failed");

A better name: "should commit error but not output to step entry on failure".

Crash safety regression from flushStorage() removal

Before this PR, each error branch flushed immediately, persisting metadata.attempts, metadata.status, and entry.kind.data.error before rethrowing. Now those are only flushed at the next outer-loop checkpoint (e.g., before yielding/sleeping).

The crash window is:

  1. metadata.attempts++ and metadata.dirty = true set before the try block
  2. Step fails - error branch runs, no flushStorage()
  3. Process crashes before the outer loop flushes
  4. On recovery: metadata.attempts is stale (under-counted), the step retries with wrong attempt count, and may silently exceed maxRetries

This window was effectively zero pre-PR. A comment on the error catch block documenting that the caller (outer loop) must flush before sleeping/terminating would make the invariant explicit.

Migration risk from stepData.error ?? metadata.error to metadata.error

The old fallback handled partial-write scenarios where the history entry was written but entry.kind.data.error was absent (crash mid-flush). Since both fields are now set before any flush, this fallback is structurally redundant for new executions. However, workflows that crashed mid-flush under the old code may have metadata.error set but entry.kind.data.error absent in storage. Replaying with new code produces undefined error messages on StepExhaustedError for those in-flight workflows. Low risk for non-production deployments, worth noting for production.

structuredClone edge case (minor)

structuredClone throws on non-serializable values (class instances, Map, Set, functions). If actor state or vars contain such values, the snapshot itself throws before #withActorAccess runs, leaving the step un-executed with a confusing error. Zod validation on user-defined schemas largely prevents this, but a one-line comment noting the assumption would help.


Test Coverage

  • Removing StripStepHistoryErrorDriver is justified: the scenario it covered (metadata has error, history entry does not) is now structurally prevented.
  • The loop test expectation change (iterationsExecuted[0]: 2 to 3) is correct: with state rollback, iteration 2's state commits durably after Loop.continue, so iteration 3 is the replay start on crash-recovery.
  • Gap: No test covers the crash-recovery scenario for under-counted retries (process dies after step fails, before the outer flush). This is the main correctness risk introduced by the flushStorage() removal.

Summary

The RAW_STATE_SYMBOL snapshot/restore approach is clean and the deduplication is a clear improvement. Three things before merging:

  1. Fix the test name (it says "should not commit" but asserts that it does).
  2. Add a comment on the catch block documenting the caller-must-flush invariant, or add a test for crash-recovery with mis-counted retries.
  3. Note the partial-write migration risk for any workflows in-flight under the old code.

@abcxff abcxff force-pushed the 05-13-fix_rivetkit_decode_bridge_errors_with_actor_null branch from 8625f84 to 46973f2 Compare May 13, 2026 18:49
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 26b0baa to 1344263 Compare May 13, 2026 18:49
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 1344263 to ec51340 Compare May 13, 2026 19:11
@abcxff abcxff force-pushed the 05-13-fix_rivetkit_decode_bridge_errors_with_actor_null branch from 46973f2 to c5ef70f Compare May 13, 2026 19:11
@abcxff abcxff changed the base branch from 05-13-fix_rivetkit_decode_bridge_errors_with_actor_null to graphite-base/5010 May 13, 2026 20:07
@abcxff abcxff force-pushed the graphite-base/5010 branch from c5ef70f to 8c70217 Compare May 13, 2026 20:07
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from ec51340 to 42fadeb Compare May 13, 2026 20:07
@abcxff abcxff changed the base branch from graphite-base/5010 to 05-06-fix_rivetkit_bind_methods_through_createwritethroughproxy May 13, 2026 20:08
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 42fadeb to 2258fbe Compare May 13, 2026 20:09
@abcxff abcxff force-pushed the 05-06-fix_rivetkit_bind_methods_through_createwritethroughproxy branch from 8c70217 to 00af8a0 Compare May 13, 2026 20:09
@abcxff abcxff force-pushed the 05-06-fix_rivetkit_bind_methods_through_createwritethroughproxy branch 2 times, most recently from ad1ebe6 to 6d4d578 Compare May 13, 2026 20:15
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 2258fbe to aa538e5 Compare May 13, 2026 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant